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CORRELATION 


AND 


MACHINE CALCULATION 


The rapid extension during recent years of the ideas of simple 
correlation has imposed their use upon many scientists not 
trained in the mathematical theory underlying them. The pres- 
ent trend in all biological sciences, as well as in economics and 
psychology, is still further to extend the use of correlation, 
broadening its scope to include the associations among more than 
two variables. One object of this bulletin is to present in simple, 
untechnical language some explanation of the meaning and uses 
of the various correlation coefficients, simple, partial and mul- 
tiple. 

The second and principal object of the bulletin is to set forth 
explicit directions for the use of the usual commercial forms of 
calculating machines, either key-driven, such as the Comptometer 
and Burroughs Calculator, or crank driven, such as the Monroe 
or Marchant, in finding correlation coefficients or related con- 
stants. According to the usual procedure, where the arithmetic is 
done mentally, the use of the correlation table, or double entry 
table, is almost indispensable. The advent and prevalent use of 
calculating machines, however, make practicable a return to 
simpler and more direct methods of reckoning. These machines 
are admirably adapted to the calculation of all the correlation 
constants with speed and precision. 

For extensive data where the number of observations runs into 
the thousands, punched cards should be used with sorting and 
tabulating machines, such as the Hollerith machines. The aver- 
age research worker, however, who is dealing with less than 500 
eases, will probably find the methods herein set forth well 
adapted to his use. 

For the benefit of those readers who are not familiar with the 
ideas of simple correlation between two variables, we shall pre- 
sent them very briefly in the following paragraphs. 


PART I. SIMPLE CORRELATION. 


A simple correlation coefficient, r, between two variables is 
a measure of the degree to which they tend’ to be associated or to 
move together. If they should move in the same direction, keep- 
ing perfect step all the way, r is so designed as to take the value, 
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+1; if on the contrary they should move in exactly opposite di- 
rections, but at the same proportional rates, r then assumes the 
value, — 1. In actual statistical work, such perfect associations 
do not occur, and r will usually be found to lie somewhere be- 
tween —.95 and +.95. As an example, if we take the size of the 
corn crop and the price of corn per bushel year by year for a 
number of years back, we find that r == — .78. This is the numeri- 
eal measure, furnished by the methods of correlation, of the 
very real tendency of large corn crops to be associated with low 
prices, and vice versa. 

It is interesting to observe why the letter r is universally used 
to designate a correlation coefficient. Sir Francis Galton, who 
first used the idea of correlation, as here presented, back in the 
early 1880’s was working on the problem of the degree to which 
children inherited height from their parents. He looked upon the 
tendency of children to resemble their parents only partly while 
partly reverting to racial characteristics, as a ‘‘regression’’ of 
the inherited characteristics upon those of the parents. Origi- 
nally, therefore, r stood for regression or reversion. 


TABLE 1. CORN YIELD AND LAND VALUE 


Obser- Average Corn Average land val- 
vation County Yield in bushels| ue per acre, Jan. 
number per acre, 1910-1919 1, 1920 
| A | xX 
1 Allamakee | 40) | $ 87 
2 Bremer 36 133 
3 Butler | 34 | 174 
4 Calhoun 41 285 
5 Carroll 39 263 
6 Cherokee 42 274 
7 Dallas 40 235 
8 Davis 31 104 
9 Fayette 36 141 
10 Fremont | 34 | 208 
11 Howard | 30 | 115 
12 Ida 40) | 271 
13 Jefferson | 37 | 163 
14 Johnson | 41 193 
15 Kossuth | 38 203 
16 Lyon | 38 | 279 
17 Madison | 34 179 
18 Marshall 45 | 244 
19 Monona | 34 | 165 
20 Pocahontas 40 | 257 
21 Polk | 41 252 
22 Story 42 280 
23 Wapello | 35 | 167 
24 Warren | 33 | 168 
| 


25 Winneshiek 36 | 115 
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In fundamental research in the biological, agricultural, eco- 
nomie and educational fields, correlation is often of the greatest 
service in demonstrating the value of hypotheses already tenta- 
tively accepted, and in suggesting new hypotheses for verifi- 
cation. 

Further explanation and discussion will be made in connec- 
tion with the data of Table 1, consisting of paired obser- 
vations on the corn yield in bushels per acre (average for years 
1910-1919), and land value per acre (Jan. 1, 1920) in 25 lowa 
counties. 

The numbers giving corn yield will be designated by the sym- 
bol A; those giving land value, by X. The value X is to be 
thought of as ‘‘dependent’’ upon the yield, A. Hence X is to be 
considered as the ‘‘eriterion’’ or ‘‘dependent variable,’’ while A 
is called the ‘‘independent variable’’. 

We shall first explain in detail the procedure of calculation, 
summarizing the results and formulas in Table 2, and shall then 
discuss the meaning of the results obtained. 

First. Add each column of observed values on the machine, 
designating the sum of the A-values Ly SA (3 is the Greek equiv- 
alent of S. The symbol SA is read either ‘‘Sigma A’”’ or better 
‘‘sum of the A’s.’’) and the sum of the X-values by =X; that is 


YA = 40 + 36 + 34-4 ete., = 937 
3X = 87 + 133 + 174-4 ete., = 4,955 


Second. Using the machine, divide each sum by the number 
of observations, 25, the results being the arithmetic means (aver- 
ages) of the variables. In the formulas, the number of obser- 
vations is denoted by n. If we designate the means by Mg, and 
Mx respectively, we then have, 





SA 937 
M, = —— > —— = 374.48 bu. per acre 
n 20 
+X 4955 
Mx = — —— = $198.20 per acre 
n pS 


Third. Caleulate the sum of the squares of the individual 
A’s, thus, 


SA? = (40)? + (36)2+ (34)2 + ete., = 35,461 


The individual values should be squared on the machine 
(40 & 40), (86 & 36), ete., and the sum carried through the en- 
tire twenty-five operations without clearing the machine. Thus, 
no record is made of the individual squares, but only of their 
sum. Similarly, 
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SX? — (87)? +(133)? + (174)? + ete., = 1,075,817 


Fourth. Calculate the sum of the products of the pairs of 
values: 


SAX — (40 X 87) +(36 X 133) -- (34 X 174)-+ ete.,= 189,533 


These products are also carried without clearing the machine, so 
that the total in the machine at the end of the twenty-five mul- 
tiplications is the required ‘‘ product-moment’’. 

A word of explanation should be interpolated as to the num- 
ber of decimal places carried in this illustrative example. Since 
M, = 37.48 bu. + .50 bu., only the first decimal place has statis- 
tical significance. Mx = $198.20 + $8.26, so that the number of 
cents need not be carried at all. (The reliability of r will be dis- 
eussed later.) The arithmetical work has usually been carried to 
four decimal places, however, so that the reader may verify the 
operations without confusion. This is merely for convenience 
and uniformity and is not intended to denote statistical signifi- 
cance. 

The results of the four operations just described (Lines 1, 2 
and 3, Table 2) constitute the data for the calculation of the 
correlation coefficient between A and X. The formula used is a 
form of the ordinary product-moment formula, as follows: 


SAX — (3A)Mx 
A, Re a Sa LE Sy eT Se 
V SA? — (3A)Ma KX V SX? — (3X) Mx 


This may be read in words, ‘‘The correlation coefficient between 
X and A is given by a fraction whose numerator is the differ- 
ence between SAX and the product of SA by Mx. The denomi- 
nator is the product of two square roots, the first being the 
square root of the difference between SA? and the product of 
=A by Ma; the second, the square root of the difference between 
=X? and the product of =X by Mx.”’ 

This is almost the same form for r as that given in Rietz’s 
‘‘Handbook of Mathematical Statistics’’, page 122. It is in no 
sense an approximation, but is derived by ordinary algebraic 
processes from the more usual forms. 

The ecaleulation is now completed in the following steps, be- 
ginning (where we left off) with the 

Fifth. Caleulate the three products, 


(3A)M, = 987 X 37.48 = 35,119 
(3X)Mx = 4,955 & 198.20 = 982,081 
(3A)Mx = 937 X 198.20 = 185,713 


Enter these results as indicated in Line 4, Table 2. 


\ 


SUMMARY OF FORMULAS AND CALCULATIONS 


TABLE 2. 
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Sixth. Subtract the numbers in Line 
4 from. those in Line 3 as indicated in 
Table 2. 

Seventh. Extract the square roots of 
the first two results just obtained; thus, 


V 3A? — (3A) M, = V'342 = 18.49 
V =X? — (3X) Mx = 93,736 = 306.16 


Square roots may be calculated on the 
machine or by the usual arithmetic 
methods, but a table of squares and 
square roots (such as Barlow’s) gives 
the results much more rapidly. 

Eighth. Multiply the two results just 
obtained to get the denominator of the 
fraction in the formula for r; 


V 3A? — (3A)Ma X V =X? — (3X) Mx 
—=18.49306.16—5,660.9 


Ninth. To obtain r, divide as in- 
dicated in the formula; 





thus completing the ecaleulation of r. 
However, for later use we shall add to 
Table 2 the following: 

Tenth. Compute the standard devia- 
tion of the A’s thus, 


V2A?— (3A)M, 18.49 
{= —:::.0.0.) ee SS 
== 3.70 bu. per acre 


Also, for the standard deviation of the 
X’s, 


VX? — (3X)Mx 306.16 


ox = 
Vn 4) 
= $61.23 per acre 





10 CORRELATION AND MACHINE CALCULATION 


For the benefit of the novice in the use of a calculating ma- 
chine, it is suggested that sub-totals be recorded frequently when 
a long column is being added, especially if multiplications are 
being done at the same time. This helps in checking the results. 
An experienced operator will check his work 49 times out of 50 
the second time over. The following gives some idea of the speed 
that may be maintained by fairly proficient operators in carry- 
ing through the various operations in the problem just com- 


pueted : 
APPROXIMATE TIME OF CALCULATIONS 























time in 
® ~~ 
g22°q || min 
: : : S523 of 
Time of operations in seconds gFere nes 
PSS | cuia- 
sA | 3X | 3a? | 3X? |3AX fons 
Key 
Driven 15 20 45 100 90 Ti ) 
Machine | | | | 
Crank | | 
Driven 45 oo | 115 145 | 140 12 10 
Machine | | | 




















It should be distinctly understood that these figures are given 
merely for the guidance of the novice, and have little or no bear- 
ing on the relative merits of key and crank driven machines. 
Each type of machine has peculiar advantages, and the type to 
be used in any given office depends upon many circumstances be- 
sides the speed attained in the calculation of this particular 
problem. 

In the following sections will be set forth the meaning and 
uses of simple correlation coefficients, using the one just caleculat- 
ed as an illustration. The beginner is warned not to attempt a too 
literal interpretation. Although perfect correlation is measured 
by 1.00, the r= .67 (we shall carry only the first two decimal 
places in this discussion) cannot be thought of as a percent. 
There is no absolute scale on which we can say that one correla- 
tion is high and another low. 


RELIABILITY OF THE CORRELATION COEFFICIENT 
As is the case with all statistical constants, the reliability of a 
correlation coefficient is indicated by the smallness of its stand- 


ard deviation. Denoting the standard deviation of r by the 
symbol, o,, the formula is, 


Tg 
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1 — r’? 
Co; = 
Vn 
In our example 
1 — (.67)? 
o, == —————__ = ll 
V 25 


‘lo interpret this in connection with our r of .67, we first calcu- 
late the range from (.67—.11) to (.67-+.11); that is, the 
range from .56 to .78. We then say (from theoretical considera- 
tions) that if these data were collected over and over from simi- 
larly located counties the chances are that about 68% of the 
resulting r’s would lie between .56 and .78. Of the other 32%, 
about half would lie below .56 and the remainder above .78. 

If the reader is more familiar with the idea of ‘‘ probable de- 
viation’’ (probable error) then he may use the formula, 


1 — r? 
E, = .6745 ; 
Vn 
which gives a probable deviation of .07 in our example. The 
corresponding range is now from .60 to .74 (.67 + .07) and the 
interpretation is that in future experiments similarly conducted 
we may expect about 50% of the resulting r’s to lie within 
this range. 

It is now evident that only the first two decimal places in r 
have statistical significance. As will appear later, the arithmetical 
operations are standardized by carrying the calculations to four 
places of decimals, but this is done merely for convenience in 
verifying the results. For an excellent short statement as to the 
number of significant figures, see Truman L. Kelley, ‘‘ How Many 
Figures are Significant?’’ in Science, Vol. LX, No. 1562, page 
524, Dee. 5, 1924. 

It 1s perhaps simpler to calculate a range within which all 
r’s would be likely to lie. While certainty is unattainable, we 
may say that a range of twice the standard devation will usually 
contain above 95% of similarly obtained r’s, while a range of 
three times the standard deviation will probably contain more 
than 99% of them. The first of these ranges is sufficient for 
ordinary practical work, while the second would be accepted for 
most scientific work. In our example, 20, 2X .11=.22. We 
may therefore reasonably expect 95% of similarly obtained r’s 
to lie between .45 and .89 (.67 + .22). Since 30, —=.33, the 
range .67 + .33 (from .34 to 1.00) will probably contain all simi- 
larly caleulated r’s. 

It is now easily understood why reliability is measured by the 
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smallness of the standard deviation. The smaller the range neces- 
sary to include 95% of all similarly calculated r’s, the more 
likely it is that such r’s will closely approximate the one already 
obtained. Study of the formula for o, will reveal that two ele- 
ments enter into the determination of its smallness; first, the 
largeness of r itself, and second, the largeness of the number of 
observations. If, for example, our r had been .77 instead of 
.67, its standard deviation would have been 


1 — (.77)? 
oO, == —--——_—__—_ = .08 
5 


and the smaller range from .61 to .93 (.77 +2 .08) would 
be likely to embrace 95% of such r’s. On the other hand, if 
the original r—.67 had been obtained from 100 observa- 
tions instead of 25, the corresponding standar1 deviation would 
have been 


1 — (.67)? 
oO, == ———_——- = .055 
V 100 


just half of the actual value in the given example. The cor- 
respondingly smaller range from .56 to .78 (.67 + 2 x .055) 
would then contain 95% of such r’s. 

The student who will experiment with the formula, testing 
for reliability r’s of various sizes and depending upon differ- 
ent numbers of observations, will soon gain a real appreciation 
of the way in which their reliability depends upon these elements. 


THE REGRESSION EQUATION 


The most practical use of r is in the calculation of the equa- 
tion of the ‘‘regression line’’, whose meaning and use will now 
be discussed. Fig. 1 shows the familiar dot diagram, or scatter 
diagram, of the data of Table 1. One dot, properly located, 
represents each pair of values in the table. The fact that A and 
X are correlated is shown qualitatively on this diagram by the 
distribution of the dots in a band and not merely at random. The 
regression line shows the trend of this band of dots. It repre- 
sents the best average position of the dots that statistical study 
is able to furnish. The line is plotted on the diagram and the 
estimated land values of Table 3 are calculated by means of the 
‘‘regression equation’’ (in which the symbol X is read ‘‘esti- 
mated value of X’’), 


Ox 





X=Mx+rX (A— Ma) 


OA 
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FIGURE 1. THE REGRESSION LINE 
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If we substitute in this formula the values computed in our ex- 
ample, we have, 


_ 61.23 
X = 198.20 + .6747 X —— (A — 37.48), 
3.70 


or, performing indicated arithmetical operations, 
X =11.17A — 220.45 


This means that for a particular corn yield, say A == 38 bu. per 
acre, the corresponding estimated land value will be 11.17 «* 38 
— 290. 45 = $204.01 per acre. 

Continuing the calculations as above of estimated values from 
actual corn yields, we have the values appearing in Table 3. 
(The number of cents is not recorded as it has no statistical sig- 
nificance. ) 

There are two counties, Kossuth and Lyon, whose average corn 
yield is 38 bu. per acre. It will be observed that the correspond- 
ing estimated land value of approximately $204 per acre agrees 
closely with the actual land value of Kossuth county, but is $75 
too low for Lyon county. The estimated land value is a kind 
of average value (not the arithmetic mean) corresponding to 
this particular figure for corn yield, but taking account of the 
peculiarities not only of these two counties, but also of all the 
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TABLE 3. LAND VALUE ESTIMATED FROM CORN YIELD 


Actual Aver- 


age land value| Estimated Errors of 
Observation per acre, land value Estimate 
Jan. 1, 1920 per acre 
1. Allamakee $ 87 $226 —139 
2. Bremer 133 182 — 49 
3. Butler 174 159 15 
4. Calhoun 285 238 47 
5. Carroll 263 215 48 
6. Cherokee 274 249 25 
7. Dallas 235 226 9 
8. Davis 104 126 — 22 
9. Fayette 141 182 — 41 
10. Fremont 208 159 49 
11. Howard 115 115 evs 
12. Ida 271 226 45 
13. Jefferson 163 193 — 30 
14. Johnson 193 238 — 45 
15. Kossuth 203 204 — 1 
16. Lyon 279 204 75 
17. Madison 179 159 20 
18. Marshall 244 282 — 38 
19. Monona 165 159 6 
20. Pocahontas 257 226 31 
21. Polk 252 238 14 
22. Story 280 249 31 
23. Wapello 167 170 — 3 
24. Warren 168 148 20 
25. Winneshiek 115 182 — 67 


twenty-five counties. To the student of statistics, the case of 
Lyon county is the more interesting and important. He imme- 
diately asks, ‘‘ What peculiarity has this county that makes its 
land value diverge so greatly from the estimated or average 
value?’’ This is exactly the question whose answer must be 
found in the later chapters on multiple correlation. Corn yield 
is only one of the many characteristics entering into the deter- 
mination of land value. An examination of Table 6 will show 
that whereas Lyon county has close to the average corn yield, 
its percentages of farm land in corn and small grain, and its 
number of brood sows per thousand acres are all much higher 
than average. Multiple correlation is a scheme for taking into 
account associations of all these elements with land value. 

We now come to the real problem in any statistical study—how 
to interpret the results. As indicated above, one of the most 
fruitful sources of information is the study of the cases in which 
the estimated values diverge most widely from the actual land 
values. It should be noticed that these ‘‘errors of estimate’’ are 
positive or negative according as the estimated value falls short 
of the actual value, or exceeds it. We have just discussed the 


9 
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largest of the positive errors of estimated land value, that of 
Lyon county ; let us now study the largest of the negative errors. 
Allamakee county has a land value of $139 less than that pre- 
dicted on the basis of corn yield alone. Referring again to Table 
6, we find that although this county has a high corn yield per 
acre, its percentages of land in corn and small grains are among 
the lowest of the 25 counties. It is obvious that we shall have 
to include one or both of these elements in our study. 

In order to help the reader to get a correct concept of the 
kind of averages these estimated values are, we call his attention 
to the following facts: 

1. The sum of the positive errors of estimate (actual values 
minus estimated values) is equal to the sum of the negative er- 
rors; that is, the algebraic sum of all the errors is zero. Hence, 
while any one of these estimated (average) values may deviate 
considerably from the actual, they are so adjusted that the alge- 
braic sum of all such deviations is the least possible. 

2. The sum of the squares of these errors of estimate is less 
than it would be if any other linear regression equation had 
been used. Since the ‘‘standard error of estimate’’ (to be ex- 
plained later) is caleulated directly from such sum, it follows 
that the standard error of estimate is less than any other root- 
mean-square average of such errors of estimate. 

All this means that the regression line is drawn through the 
dots in such a way that the algebraic sum of the vertical dis- 
tances of all the dots from the line is zero, and the sum of the 
squares of such distances is a minimum. | 

An interesting interpretation of the meaning of r can now 
be introduced. It is not only the correlation coefficient between 
corn yield and land value, but is also the correlation coefficient 
between actual land value and estimated land value. That is, 
if r were calculated for the two columns of land values in Table 
3, its value would be .67. It is therefore in a very definite sense 
a direct measure of our success in estimation. 

It should be clearly understood that there are two regression 
lines in simple correlation. The one just discussed is known as 
the regression of land value on corn yield; that is, the regression 
of X on A. If we should wish to estimate average corn yield 
from given land values we should have to use the formula for 
regression of A on X, as follows: 


OA 


(X — Mx) 





A=Ma+rX 


ox 
Using the values computed in our example, this reduces to 
“A= .0408X + 29.4 
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This represents a different line from that shown in Fig. 1; and 
any particular land value together with the estimated yield as 
computed from this last formula constitute a different pair of 
values from those found in Table 3. Only in the case of perfect 
correlation (r==1) would there be perfect agreement between 
the pairs of values calculated from the two regression formulas. 


THE STANDARD ERROR OF ESTIMATE 


The second practical use of r is to enable us to calculate easily 
an average of the differences between the actual and estimated 
land values; that is, an average of the errors of estimate given 
in the last column of Table 3. As already indicated, the mean 
of these errors of estimate is zero, because their algebraic sum 
is zero. The average which is generally used is the root-mean- 
square average known as the ‘‘standard error of estimate’’. It 
might be obtained by squaring each of the numbers in this last 
column of Table 3, adding such squares, dividing the sum by 25 
(the number of observations) and extracting the square root of 
the quotient. Practically, however, the same result (which we 
shall designate by ox.,) is obtained by the use of the formula 


OxX-a — OX Vi—?r 


Substituting our values, we find that the standard error of esti- 
mated land values is 


ox.a = 61.23 V1 — (.6747)? = 61.23 XK .738 = $45.19 per acre 

It is to be observed that this standard error of estimate is 
73.8% of the standard deviation, cx. In other words, the stand- 
ard deviation of predicted values from actual values is only 
73.8% of the standard deviation of actual values from their 
mean. 

Since this standard error of estimate is the standard deviation 
of the differences between actual and estimated land values, it 
has the usual interpretation of standard deviations (see page 
11); that is, about 68% of the errors lie in the range from 
— $45.19 to + $45.19. Furthermore, approximately 95% of 
the errors are expected to lie in the range from — $90.38 to 
+ $90.38; and usually all the errors are included in the range 
from — $135.57 to + $135.57. An examination of the actual er- 
rors will convince the reader of the close agreement of these 
theoretically computed ranges with the facts. 

A somewhat different interpretation of the standard error of 
estimate, ox.4, may make its meaning and importance clearer. 
If we are asked to estimate the average land value of one of our 
25 counties knowing nothing of its average corn yield, we shall 
have to be satisfied with the following answer; it is more likely 
to be worth about $198.20 per acre (the mean value, Mx) than 
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any other amount, and the standard error of all such estimated 
values is $61.23 per acre (the standard deviation, ox). If, howev- 
er, knowing that the r between land value and corn yield is .67, 
we are given the additional information that the corn yield of a 
particular county is 38 bu. per acre, then we are able to better 
our estimate in two respects. We are able to say from the re- 
gression formula that the land value is more likely to be around 
$204 per acre than any other value, a much better estimate than 
before; and we are also able to say that the standard error of 
such estimates is now only $45.19 (ox.a), or only 73.8% of cx, 
thus indicating greater reliability in the estimations. 

The question arises—how much better is an r of .6 than one 
of 4? What does the relative size of the r’s mean? To answer 
this, we say that if r—.6, then 


ox.a = ox V 1 — (.6)*=.8 ox or 80% of ox, 
while if r=.4, then 
ox.a = ox V1 — (.4)? = .917 ox or 91.7% of ox. 


Thus, an r of .6 reduces the standard deviation of estimated 
values by 20%, whereas an r of .4 reduces it by only 8.3%. The 
following table gives the percentages by which different r’s re- 
duce the standard deviations of estimated values: 


TABLE 4. REDUCTION OF STANDARD DEVIATION 





Pet. Reduc- Pet. Reduc-| _ |Pet. Redue- 
‘ tion of . tion of = tion of 
Standard Standard Standard 
Deviation Deviation Deviation 
.05 1% .50 13.4% 92 60.8% 
10 5% .55 16.5% 94 65.9% 
15 1.1% .60 20.0% 95 68.8% 
.20 2.0% .65 24.0% .96 712.0% 
25 3.2% .70 28.6% 97 15.7% 
.30 4.6% 75 33.9% .98 81.0% 
.3D 6.3% .80 40.0% .99 85.9% 
.40 8.3% .85 47.3% .999 95.5% | 
.45 10.7% .90 56.4% 1.000 100.0% 


or prediction perfect 


From this table it is possible to say that for estimating pur- 
poses an r of .8 reduces the standard deviation twice as much 
as does an r of .6. Correlations of less than .4 are evidently 
practically worthless for estimating purposes because they_re- 
duce the standard deviation of estimated values by less than9/%. 
This table also shows very strikingly why it is that correlation 
coefficients cannot be interpreted as percentages. 

Beginners almost invariably attach more significance than they 
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should to low correlations and less than they should to higher 
correlations. In working with multiple correlations they some- 
times find that the inclusion of two more variables will raise the 
correlation from .90 to .94. They hesitate to go to the labor of 
the extra calculation because they think that a gain of .04 is in- 
significant. This is a mistake. The table indicates that an r of 
.90 reduces the standard deviation by 56.4%, whereas an r of 
.94 reduces it by 65.9%. An additional reduction of 9.5% in 
the standard deviation of estimated values is tremendously worth 
while. In this sense, an r of .94 excels an r of .9 by more 
than an r of .4 excels one of .05. 

In concluding this part on simple correlation it should be 
observed that all of the interpretations and conclusions are valid 
only in so far as the band of dots in the seatter diagram approxi- 
mates rectilinearity. Any pronounced curve in this band indi- 
cates a curvilinear regression line. The validity of the methods 
herein described decreases as the curvilinearity of the regression 
line increases. This applies to the following pages also. It 
should be emphasized that scatter diagrams should always be 
plotted before proceeding with any calculations of correlation. 
Should there be a pronounced curvilinearity of regression it may 
or may not be practicable to divide the entire range into two or 
more sections and treat them separately. Of course, the results 
obtained will be valid only within the particular range or ranges 
treated. Another recourse is to fit a curve to the regression as 
an empirical formula. The resulting regression equation is used 
for estimation purposes exactly like those herein described. Some 
interesting work is being done on this problem by Dr. F. C. 
Mills and Mr. Mordecai Ezekiel, their results appearing in some 
of the current numbers of the Journal of the American Statis- 
tical Society. 


PART II. MULTIPLE CORRELATION—THREE 
VARIABLES. 


Whenever we dig thoroughly into any problem, we generally 
find it necessary to study a whole net-work of relationships. In 
the case of land values, we soon found other variables besides 
corn yield affecting land value. In the present chapter we shall 
consider the solution of the problem of one additional variable 
introduced—percentage of farm land in small grains. The values 
of this new variable, given in Table 6, will be designated by the 
letter B. 

First, of course, we shall have to calculate in the same way as 
or the following constants for the new series of observed 
values: 


SB = 488, Mp = 19.52%, 3B? — 10,418, SB? — (3B) Mp, — 892, 
/ 3B? — (3B) Mz = 29.87, on =5.97%. 
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Next, in order to find the new simple correlations required, we 
shall need 


SBX = 104,064, SAB = 18,519, 
SBX — (3B)Mx= _ 7,342 SAB — (3A)Mp=— = 229. 


Now, the two correlation coefficients are found exactly as in 
Part 1, a simple interchange of letters giving the new formulas. 

In order to distinguish the three r’s, we shall add as sub- 
scripts the letters corresponding to the variables correlated; 
thus rax is the one already found—the correlation coefficient be- 
tween corn yield A, and land value X. Denote by rap, the cor- 
relation coefficient between the percentage of land in small grain 
B, and corn yield A; and by rgx the correlation coefficient be- 
tween percentage of land in small grain B and the land value X. 
The order in which the subscripts are written has no significance ; 
that is, rxa is the same coefficient as rax, ete. The new ecalcu- 
lations result in, rap —.4146 and rpx—.8029. These two new 
r’s together with the one formerly found (rsx = .6747) consti- 
tute the necessary data for the calculation of the multiple corre- 
lation coefficient which is always denoted by R, and the mul- 
tiple regression equation. 

From this point on, the method of procedure differs radically 
from that of Part I. In the first place we shall have to introduce 
two new quantities, the ‘‘partial regression coefficients’’. Their 
use will appear presently. They are denoted by the symbols 
Bxa (B is the Greek letter for b, and the symbol is read ‘‘beta 
X A’’) and Bx. The two £’s are found by solving a pair of 
simultaneous equations known as ‘‘normal equations’’. (See 
Kelley: ‘‘Statistical Method’’, p. 282.) In symbolical form these 
two equations are written 


Bxa + TanBxp =Tax, 
TapPxa + Bxp = Tex. 
Using the data of our problem, the equations become, 


1.0000Bxa + .4146@xn = .6747 (1) 
4146 8x, + 1.00008xx = .8029. (2) 


These equations may be solved for Bx, and Bxp by any of the 
usual methods. For example, copy down the second equation, 
and multiply the first by .4146, thus; 


A146Bxa + 1.00008xn = .8029 
4146Bx, + .17198xn = .2797 


Subtracting: 8281 Bxp = .5232 
Dividing by .8281: Bxp = .6318 


Substituting in the first equation: 
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Bxa + .4146 X .6318 = .6747 
Solving: Bxa == .4126 
The correctness of the solution may be tested by substitution 
in the second equation. 


We now ealculate the ‘‘multiple correlation coefficient’’, de- 
noted by the symbol R, by means of the formula 


R? = BxaYax + Bxelsx. 
Substituting: R? = .4126 < .6747 + .6318 « .8029 — .7856 
Therefore, R= .89 


Parenthetically, it may be observed, that while the subscripts 
of the r’s may be interchanged, those of the 8’s may not. Thus 
Bax does not denote the same number as Bxa. The meaning of 
Bax will be explained later. 

A second parenthetic observation will be of interest to many 
readers. These @’s are the same as the ‘‘path coefficients’’ used 
by Sewall Wright in ‘‘Correlation and Causation’’ (Jour. Ag. 
Res. Vol. XX, No. 7, pp. 557-575), and the products, Bxarax and 
Bxsrex are his ‘‘coefficients of determinaiion’’. 

Before discussing the meaning of R, we shall complete the cal- 
culations by computing the constants of the new regression equa- 
tion from the formula: 


Ox ox 
(A — Ma) + Bxp 


OA OB 


The use of the 8’s is now obvious. They play the same role in 
the multiple regression equation as the r’s do in simple regres- 
sion equations. Substituting our values for the above symbols, 
we find the new estimated value of X to be, 


_ 61.23 
X = 198.20 + .4126 X —— (A — 37.48) + 
3.70 


(B — Mz). 





x= Mx + Bxa 





61.23 
6318 —— (B — 19.52) 
_ 5.97 


or, X = 6.828A -+ 6.478B — 184.16 


As the final step in the calculations, we must now find the esti- 
mated values of X corresponding to each pair of actual values 
of A and B. For example, in Allamakee county, A = 40 bu. per 
acre corn yield, and B 11% of farm land in small grain. Sub- 
stituting in the regression equation, we find the corresponding 
estimated land value to be, 


X = 6.828 & 40 + 6.478 K 11 — 184.16 = $160.22 per acre. 
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Continuing this process for each county, we get Table 5. 

Before discussing the reasons for the more glaring errors of 
estimate, we shall return to the subject of the meaning and use 
of the multiple correlation coefficient, R. In the first place, R 
is the simple correlation coefficient between actual land values 
and land values estimated from the regression equation. In other 
words, it is precisely the same kind of a measure of our success 
in estimating or predicting with two independent variables as 
r was a measure of our success with one independent variable 
(see page 15). In the second place, just as in simple correla- 
tion, R enables us to compute readily the standard error of esti- 
mate which, with two independent variables, A and B, is de- 
noted by the symbol, ox.az. 


The formula is ox.as==ox V 1 — R?, or in our example, 
ox.as = 61.23 VY 1 — .7856 = 61.23 XK .463 = $28.35 per acre 


That is, our value R= .89 enables us to reduce the standard de- 
viation of estimated values to 46.3% of the standard deviation of 
the X’s; or in other words, to reduce it by 53.7%. (Compare 


TABLE 5. LAND VALUE ESTIMATED FROM CORN YIELD AND 
FARM LAND IN SMALL GRAIN 


Actual Aver- Estimated Error of Esti- 


Observation age land land value per ant 
value per acre acre . 

1. Allamakee $ 87 $160 —713 
2. Bremer 133 146 —13 
3. Butler 174 171 3 
4. Calhoun 285 310 —25 
5. Carroll 263 244 19 
6. Cherokee 274 252 22 
7. Dallas 235 231 4 
8. Davis 104 86 18 
9. Fayette 141 146 — 5 
10. Fremont 208 158 50 
11. Howard 115 137 —22 
12. Ida 271 238 33 
13. Jefferson - 163 159 4 
14. Johnson 193 180 13 
15. Kossuth 203 231 —28 
16. Lyon 279 276 3 
17. Madison 179 152 27 
18. Marshall 244 246 — 2 
19. Monona 165 178 —13 
20. Pocahontas 257 283 —26 
21. Polk 252 238 14 
22. Story 280 239 41 
23. Wapello 167 158 9 
24. Warren 168 158 10 


25. Winneshiek | 115 178 —63 
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Table 4). We are now likely to find more than 95% (only 90% 
in this particular example) of our errors of estimate lying with- 
in the comparatively small range from — $56.70 to + $56.70 
(+ 2ex.an). In the third place, the standard deviation of R is 
(as in the case of r ) 


1—R? 1 —.7856 
og = ———_ = —__—__ = 04 


Vn D 


which means that similarly derived R’s are almost certain to le 
above .80, thus practically always reducing the standard devia- 
tion of estimated values more than 40% (see Table 4). 

Returning to a consideration of Table 5 we find that the new 
estimate of land value in Lyon county is very close to the actual 
value, whereas the estimate for Kossuth county is not so good as 
before. Kossuth, having about average corn yield and average 
land value, is estimated quite closely from corn yield alone; but 
since it is close to the highest county in percentage of land in 
small grain, the inclusion of the latter variable raises the land 
value estimate too much. Other factors will have to be intro- 
duced to counterbalance this effect. Land values in Allamakee 
and Winneshiek counties, especially in the case of Allamakee, 
are better than when only one independent variable was used. 
It is still necessary to take into account the fact that these coun- 
ties have low percentages of their farm land in corn. Fremont 
county is still much above its estimated value; in fact, we have 
made a poorer estimate with two independent variables than with 
one. This is because of Fremont’s unusually large percentage of 
farm land in corn, a characteristic which we shall certainly have 
to take into account before our problem is completed. In six 
other counties besides Fremont, our newly estimated values are 
not so close to actual values as was the case when corn yield 
alone was considered. In the other eighteen counties, our esti- 
mations are closer to the facts. 


PART IIL.. MULTIPLE CORRELATION—MORE THAN 
THREE VARIABLES. 


It is quite evident from what precedes that, while we have 
made progress in our attempt to analyze the relations between 
land value and associated variables, we are still far from a sat- 
isfactory knowledge of these relations. We shall complete our 
illustrative example by including three more variables, as fol- 
lows: average number of improved acres per farm, C; number 
of brood sows per 1,000 acres, D; and percentage of farm land in 
corn, E. Table 6 gives the complete data for 25 Iowa counties. 

The principles involved in handling more than three variables 
are identical with those explained in the three variable problem. 
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TABLE 6. DATA FROM 25 IOWA COUNTIES 











n =| io K—) 
& cob) fee en 
§ = . 5 : 3,0] 8 | Sx 
~ b DO Om tm 
3 Sea {EE ‘ o4a° Au, > 
Ea County se 5S fy |ER#) Es | Qed 
] +7) e 
23 Ees|eie| ooh |obs| “El ass | § 
Oa Oa2Zl“"|I7ES | 28a! es loss nN 
A | B | Cc | D | E | x | s 
1 Allamakee 40 11 103 42 14 $ 87 | 297 
2 Bremer 36 13 102 58 30 133 372 
3 Butler 34 19 137 53 30 174 447 
4 Calhoun 41 33 160 49 39 285 607 
5 Carroll | 39 25 157 74 33 263 591 
6 Cherokee 42 23 166 85 34 274 624 
7 Dallas 40 22 130 52 37 235 516 
8 Davis 31 ) 119 20 20 104 303 
9 Fayette 36 13 106 53 27 141 376 
10 Fremont 34 17 137 59 40 208 495 
11 Howard 30 18 136 40 19 115 358 
12 Ida 40 23 185 95 31 271 645 
13 Jefferson 37 14 98 41 25 163 378 
14 Johnson 41 13 122 80 28 193 477 
15 Kossuth 38 24 173 52 31 203 §21 
16 Lyon 38 31 182 71 35 279 636 
17 Madison 34 16 124 43 26 179 422 
18 Marshall 45 19 138 60 34 244 540 
19 Monona 34 20 148 52 30 165 | 449 
20 Pocahontas 40 30 164 49 38 257 578 
21 Polk 41 22 96 39 35 252 485 
22 Story 42 21 132 54 41 280 570 
23 Wapello 35 16 96 41 23 167 378 
24 Warren 33 18 118 38 24 168 399 
25 Winneshiek 36 18 113 61 21 115 364 


First, calculate the r’s, then the #’s, and from these, R and the 
regression equation. However, with more than three variables, 
it is desirable to adopt some labor saving, systematizing and ac- 
curacy promoting devices, and these will now be explained. 

The first of these devices is the introduction of an extra vari- 
able, S, whose values are shown in the last column of Table 6. 
Each number in this column is merely the sum (hence, S) of the 
corresponding numbers of the other columns; thus, for Alla- 
makee county, 


S—= 40+ 11-+ 103 + 42 +14 +87 = 297 


The sums, S, are handled exactly like values of a seventh variable. 
The relatively small amount of extra labor involved in handling 
S furnishes a perfect check on the accuracy of the calculations, 
and obviates the necessity of repeating them. The details of the 
use of S will be given in the proper places below. 


THE SIMPLE CORRELATION COEFFICIENTS 


The second of the new devices is merely a form (Tables 7a 
and 7b) for systematizing the method of calculating the r’s 
and the o’s. (See mimeograph bulletin by Bradford B. Smith, 
*“The Use of Punched Card Tabulating Equipment in Multiple 
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Correlation Problems’’, Bureau of Agricultural Economies, 
Washington, D. C., October, 1923.) In Table 7a the entries are 
indicated by symbols alone, just as in a formula. In Table 7b 
are shown the corresponding numbers in our problem. By com- 
paring these tables with the formulas and numbers previously 
used the familiar parts will be quickly located, and the whole 
scheme easily comprehended. The blank spaces in the lower left 
portion of the tables are caused by the elimination of unneces- 
sary repetition. For example, line B, column A would naturally 
contain SBA, but this is identical with SAB (18,519) and is 
therefore omitted. 

In Table 7b, the sums in the first line are recorded directly 
from the caleulating machine, all original data being found in 
Table 6. The correctness of these sums is checked in this line 
by observing that the sum of the first six of them is equal to the 
seventh; that is, 


937 + 488 + 3,342 +1361 + 745 + 4,955 — 11,828 


The ‘‘ product moments’’ (including the sums of the squares) 
in lines A,, B,, ete., are recorded directly from the machine. 
The ecaleulation may be facilitated by folding Table 6 vertically 
so as to bring into juxtaposition the pair of numbers to be mul- 
tiplied. 

The check in line A, is furnished by adding the first six num- 
bers in that line. The sum should be the same as the product 
moment (SAS) already recorded under S in the same line; that 
is, 


30,461 + 18,519 +- ete. = 449,422 


To check line B, it must be remembered that the first of the 
product moments (SBA = SAB — 18,519) is omitted. It is nec- 
essary, therefore, to start at the top of column B, come down to 
line B,, then go across the line; thus, 


18,519 + 10,418 + 68,242 + ete. — 243,882 


Similarly, start at the top of column C, go down to line C,, then 
across, obtaining 


125,886 + 68,242 + 464,684 + 188,152 + 
101,900 + 688,739 = 1,637,603 


The products in lines A,, B., ete., are also recorded directly 
from the machine, the data being found in the first and second 
lines of the present table. The check is the same as before, ex- 
cept that all the numbers checked are now found in lines with 
subscripts 2. For line A, we have, 
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35,119 +- 18,290 +- 125,258 + ete. = 443,313 
For line E, (and column E), 


27,923 + 14,542 + 99,592 + 40,558 + 
22,201 + 147,659 = 352,474 


Each number in line A, is subtracted from the number just 
above it in line A,, the results appearing in line A,. The same 
relations obtain in lower parts of the tables. It happens in this 
problem that all these differences are positive; but in another 
problem the lower number might in some places be larger, in 
which ease the difference would be negative. This would result 
in a negative correlation coefficient. The check in lines with 
subscripts 3 is the same as before, and is the final check on this 
part of the calculation. With skillful calculators, either or both 
the preceding checks may be omitted, but this last one is essen- 
tial. 

The number of significant figures to which the results check 
depends, of course, upon the number of figures carried in the 
means. In the illustrative problem, since the number of obser- 
vations is 25, the means are made arithmetically exact by carry- 
ing only two decimal places. In another problem, however, if it 
is desired to check results to seven significant figures (as is done 
in this problem) seven figures would have to be carried in the 
means and even then the last figures would not usually check, as 
is the case in the last two numbers of the illustrative problem. 
Of course, the extra figures have no statistical significance, and 
their use would in any case be merely a matter of office practice. 
So far as statistical significance is concerned, all the numbers 
used in this problem might have been limited to the first three, 
or at most four figures. 

The first number in line A, (18.493) is the square root of the 
number just above it (342) ; and similarly, for the first numbers 
in lines B,, ete. Each of these square roots when divided by the 
square root of the number of observations (in our problem 25) 
gives the corresponding standard deviation o in the bottom row 
of the table. 

The remaining numbers in lines A,, B,, ete., are products of two 
square roots, namely the first square root in the same line by the 
last square root in the same column. For example, the number 
(1,036.3) in line B, column E is the product of 29.866 (line 
B, column B) by 34.699 (line E, column E). 

The correlation coefficients of A with each of the remaining 
variables (not including S) are calculated by dividing each 
number in line A, (not including column A) by the number just 
below it. As an example, from column E, 
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338 
Tan — oo > .5267 
41.7 


As an example of the similar use of later rows, consider rows 
C, and C,, column X; from these we obtain, 


26,355 
Tcx = = .6430 
40,989 


THE NORMAL EQUATIONS 


In order to record the r’s and use them for calculating the 
8’s in the simplest way, we now turn to a consideration of Tables 
8a and 8b. Here, as its value is calculated, each r is recorded 
in the row and eolumn corresponding to its subscripts. The ex- 
act position of each r is clearly indicated in Table 8a. Observe 
that ra, =1.0000, rgs=1.0000, ete. These tables exhibit the 
third and last of the new devices to be considered in this part. 
This device is a short scheme for obtaining the solution of the nor- 
mal equations, a set of simultaneous, linear equations having the 
same number of ‘‘unknowns’’ (five in our illustrative problem) 
as there are independent variables. The unknowns in these nor- 
mal equations are the partial regression coefficients, Bxa, Bxs; 
Bxc, etc. 

Written out in full, these five normal equations appear thus: 


Bxa + YapPxp + YacBxc + YanBxp + lapBxe == Yax 
TpaBxa + Bxs + I'pcBxc + YppB8xp + rpeBxe = rex 
YoaBxa + YcpBxs + Bxc + YcopBxp + YceBxe = rcx 
YpaBxa + rppPxs + YpcBxc + Bxp + YprPxe = lpx 
YraBxa + TesBxp + YecPxc + YepBxp + Bxe = Tex 


(See Kelley: ‘‘Statistical Method’’, p. 296.) 

It will be observed that there is a diagonal row of 8’s through 
this array of equations, from the upper left to the lower right 
corner, each of whose coefficients is unity. If, now, we remem- 
ber that ras = Ta, Prac = Ica, 'ce — Tec, etc., it ean be seen that 
the r’s in the upper right hand part of the array and the equal 
r’s in the lower left hand part are arranged symmetrically with 
respect to the diagonal of unity coefficients. It is for this reason 
that short methods of solution can be used, and that we need 
keep only that portion of the equations above and to the right of 
the diagonal, together with the ‘‘1’s’’ in the diagonal itself. Fin- 
ally, it 1s unnecessary to record the §’s, since only the r’s are 
required for calculation. Thus we get the arrangement of the 
r’s in tables 8a and 8b. The directions to be given for manipu- 
lations in these tables have as their objective the solution of the 
normal equations, giving finally the values of the #’s. For an 
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extensive explanation of the whole process, see Wright and Hay- 
ford: ‘‘ Adjustment of Observations’’, pp. 114-120. 

Most of the details of manipulation can be understood by study 
of the directions given in the tables themselves, and comparison 
of the two tables. Each symbol in Table 8a stands for what- 
ever number might be entered in the corresponding cell in any 
particular problem. Thus, in our problem, rac (line 1, column 
C) stands for the number, .2536; [bb] (line 5, column B) 
stands for .8281; [dx] (line 17, column X) stands for — .1199; 
and — [dx] (line 2 of reverse, column X) stands for .1199. 
Some statements of general principles will help the operator to 
earry the details in mind. 

First. Each block of lines is narrower by one column than 
the preceding block, and after the B-block each block of lines 
is one line wider than the preceding block. If there were an F 
variable, the F-block would contain 8 lines (25 to 32 inclusive) 
and so on for any number of variables. 

Second. Beginning with the B-block, the next to the last line 
in each block (lines 5, 10, 16, ete.) consists of the algebraic sums 
of all the entries above it in the same block. Thus in Table 8b 
the number (.8281) in line 5, column B, is equal to 1 — .1719, 
and the number (.0402) in line 16, column E, is equal to 


3824 — .2631 — .0741 — .0050 


Third. The sums in the next-to-the-last line of each block 
(lines 5, 10, 16, 23) are each to be divided by the first such sum 
in the same block, the signs reversed, and the quotients entered 
just below the dividends. Thus, in line 10 the divisor is .4307; 
the dividends are .4307, .3385, .0063, .0633 and .8388; and the 
quotients with signs changed appearing in line 11 are — 1.0000, 
— .7859, — .0146, — .1470 and — 1.9475. 

Fourth. Each of the remaining lines in any block consists 
of products calculated from one of the preceding blocks. Thus, . 
in block D the products in line 13 are calculated from the A- 
block, those in line 14 from the B-block and those in line 15 from 
the C-block. To illustrate from Table 8b, consider the products 
: line 14, block D. These products come from the B-block, 
thus: 


Line D E 
Multiplicands 5 1343 | 4571 5232 2.5894 
Multiplier 6 —.1622 : 
Products | 14 | —.0218 | —.0741 | —.0849 | — .4200 
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Fifth. The last line in each block contains the coefficients with 
signs reversed of an equation from which some of the unknown 
B’s have been eliminated. Thus, from line 17 we may infer that 


1.0000Bxpn + .08698xe = .1199 
Similarly from line 24, 
Bxe — 4547 


which is, therefore, the first one of the B’s whose value is found 
after all the rest of them have been eliminated from the equa- 
tions. 

Sixth. The S-column furnishes a check on the accuracy of the 
work in each block, but does not check the calculations of the 
r’s. The entries in the S-column are not carried over from Table 
7. That in line 1 is simply the sum of the r’s to the left of it in 
the same line; that is, 


1.0000 + .4146 + .2536 ++ .4995 + 5267 + .6747 — 3.3691 


The entry in line 3, column § is likewise the sum of five r’s ar- 
ranged in the same ‘‘down and across’’ manner as used in Table 
7b; thus, going down column B and across line 3, we have 


4146 + 1.0000 + .7518 + .3414 + .6755 + .8029 = 3.9862 


As a final illustration, consider the entry in line 12, column 8. 
Down column D and across line 12, 


4995 + 3414 + 5701 + 1.0000 + .3824 + .5271 = 3.3205 


After the entries are made in column §S, they are treated ex- 
actly like the original entries in the other columns. (See Table 
8a, column S.) The check is furnished in the last line of each 
block. The number in the S column of that line should be (ap- 
proximately) equal to the sum of the numbers to the left of it in 
the same line (not down and across). Consider, for example, 
line 11, 


— 1.0000 — .7859 — .0146 — .1470 = — 1.9475 


Seventh. The ‘‘Reverse’’ (bottom five lines of tables 8a and 
8b) is the process of finding the values of the preceding B’s by 
retracing our steps, equation by equation. Some of the details 
will now have to be explained, as follows: 

(1) In column X, copy in reverse order with sign changed the 
last number (in the same column) in each block above. This 1 
elearly indicated in Table 8a, column X of the reverse. 

(2) In line 1, column BK, copy the value of Bxz, which in ou 
problem is .4547. The two numbers in reverse, line 1, are al 
ways the same. 

(3) In column E, below xz, enter in reverse order the prod 
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ucts of Bxz by the last number appearing in that column in each 
of the blocks above the E-block. For example, — .0395 = .4547 
xX — .0869, and — .0066 = .4547 K — .0146. 

(4) In reverse line 2, add (algebraically) the numbers in 
columns X and E (.1199 — .0395), placing the sum (.0804) in 
the same line, column D. This sum is the value of Bxp. It may 
easily be seen that the operations in reverse line 2 result in the 
substitution of the value of Bxz in an equation mentioned above, 


namel 
1.0000Bxp + .08698xu = .1199 


and also in its solution for the vaiue of Bxp; 
Bxp = .1199 — (.0869 K .4547) = .1199 — .0395 = .0804 


(5) Repeat in column D the operations just described, using 
as multiplier the value of Bxp (.0804). We now compute in 
reverse line 3, 


Bxc = .1470 — .0066 — .0632 = .0772 


What we have really done in reverse line 3 is to substitute the 
values of Bxz and Bxp in an equation inferred from line 11 above, 
as follows: 


1.00008 xc +- .1859 Bxp +- .0146Bxe — .1470 


and solve the resulting equation for Bxc. 

Continue this reverse process until all the B’s have been eal- 
culated, then verify the results by substituting their values in 
some one of the original normal equations. For example, read- 
ing down column D to line 12, then along line 12, we infer the 
equation, 


4995 Bx, + .34148xp + 5701 Bxc + 1.00008xp -+ .3824Bxn—=.5271 


Substituting the values of the 8’s in the left member of this 
equation it becomes, 


(.4995 X .2479) + (.3414 & .3075) + (.5701 X .0772) 
+- (.0804) -F (.3824 & .4547) 


Without clearing the machine, we compute the sum of these 
products as .5271, thus verifying the correctness of the §’s in 
the equation above. For verification purposes, any of the orig- 
inal normal equations may be used except the first (line 1), 
which has already been made use of (reverse, line 5) for caleu- 
lating the value of Bxa. 


THE MULTIPLE CORRELATION COEFFICIENT 


_ We are now ready to calculate the multiple correlation coeffi- 
cient, R, from the equation 
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R? = Bxa - Tax + Bxp - Vax -+ ete., 


— (.2479 X .6747) ++ (.3075 X .8029) + (.0772 X .6430) 
“(0804 X .5271) -- (4547 X .8621) 


The machine gives directly the sum of these products as .8982. 
The factors are readily found in Table 8b; and after a little 
practice the multiplications and additions may be carried 
through on the machine without making such a list as that given 
in the equation above, without clearing the machine. Finally, 
R= V .8982 = .95 
THE STANDARD ERROR OF ESTIMATE 
This value of R shows that if we attempt to estimate land 


values from these five independent variables, the standard error 
of estimate will be 


COX-ABCDE — 0x V 1— k’?= o190x or 31.9% of COX. 
That 1s, 
COx-ABCDE =— 19 x $61.23 — $19.53 


Thus, we have reduced the original standard deviation by 68.1%. 
In Part II, we found that by using two independent variables 
we could reduce the original standard deviation by only 53.7%. 
The addition of three more independent variables is therefore of 
real value. On the other hand, the fact that the standard error 
of estimate is still 31.9% of ox shows that the problem is not 
completely solved. There are other influences on the price of 
land which have not been considered, and it is the search for 
these that will engage the interest of the student of economics. 


THE REGRESSION EQUATION 
The regression equation with five independent variables is 








— ox ox 
X = Mx + Bxa- (A — Ma) + Bxs- (B — Mz) -+ ete. 
Oa OB 
With our data, this becomes 
61.23 61.23 


—— 


X = 198.20 + 2479 X —— (A — 87.48) + .3075 KX —— 
3.70 5.97 
61.23 
(B — 19.52) + .0772 * ——(C — 133.68) + .0804 
26.78 


61.23 61.23 
x ——(D — 54.44) + .4547 K —— (E — 29.80) 
16.28 6.94 
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TABLE 9. LAND VALUE ESTIMATED FROM FIVE VARIABLES 


Average Estimated Error of 
County land value land value Estimate 
per acre per acre 
Allamakee $ 87 $109 —22 
Bremer 133 168 —35 
Butler 174 183 — 9 
Calhoun 285 295 —10 
Carroll 263 245 18 
Cherokee 274 260 14 
Dallas 235 244 — 9 
Davis 104 86 18 
Fayette 141 155 —14 
Fremont 208 219 —l11 
Howard 115 115 6 
Ida 271 246 25 
Jefferson 163 149 14 
Johnson 193 191 2 
Kossuth 203 225 —22 
Lyon 279 271 8 
Madison 179 152 27 
Marshall 244 247 — 3 
Monona 165 188 —23 
Pocahontas 257 278 —21 
Polk 252 230 22 
Story 280 266 14 
Wapello 167 139 28 
Warren 168 144 24 


Winneshiek 115 150 —35 


v-—_—— 


X = 4.103A -+- 3.154B ++ .1766C + .3022D +- 4.012E — 176.76 


Using this equation, we calculate the land values shown in Table 
9 


It should be observed that in this problem where all the #’s 
are positive, the land value of any one county is found by adding 
five products and subtracting $176.76. This should be done, as 
usual, without clearing the machine. In this way, the estimated 
values for all the counties can be found in a short time. If part 
of the B’s were negative, their terms should be subtracted in- 
stead of added. This is done in the crank driven machine by 
turning the crank backward (subtracting) instead of forward. 
- eae driven machine, the ‘‘ecomplementary’’ number must 

e used. 

If we compare our latest errors of estimate with those made 
on the basis of two independent variables, we find notable im- 
provement in the cases of Allamakee, Calhoun, Fremont, How- 
ard, Story and Winneshiek counties, but much poorer estimates 
for Bremer and Wapello. Land values in nine more of the 
twenty-five counties are not estimated so well with these five 
independent variables as with two, but the changes are rela- 
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tively insignificant. This strengthens our previous conclusion 
that the student of economics has still a long way to go before 
he finds all the factors that are highly associated with land 
values. 

SCORING 

The regression equation is the best scoring device available, 
the ‘‘score’’ of any individual being the value of the criterion, 
X, calculated from a given set of values of the independent 
variables. In this sense, the estimated land values found in the 
second column of Table 9 constitute the scores of the correspond- 
ing counties on the basis of land value. 

There are times, however, when a simpler scoring device is de- 
sirable. While there is no general agreement on the subject, the 
partial regression coefficients probably constitute the simplest 
and most straightforward data for making a score card. In the 
table below is entered the value of each B in our land value 
problem, and just beneath it is placed a rate percent. Each 
rate percent is found by dividing the corresponding B by the 
sum (1.1677) of the five B’s. 


SCORE CARD 


Im- 


Corn | Small prov- Brood | Corn eats 
Yield | Grain | ed Sows | Land 
Lahu 
Coefficients || .2479 | .3075 | .0772 | .0804 | .4547 || 1.1677 
Rate Percents 
or Scores 21 26 7 7 39 100 


If the counties of Iowa are to be scored on the basis of the 
data in Table 6, it thus appears that 39% of the score should 
be based on the percentage of farm land in corn, 26% on per- 
centage of farm land in small grain, 21% on corn yield in bush- 
els per acre, and 7% each on number of acres of improved 
land per farm and number of brood sows per 1,000 acres. 


PART IV. PARTIAL CORRELATION COEFFICIENTS 


In partial correlation coefficients, the attempt is made to de- 
termine the degree of association that would exist between two 
variables if we could eliminate the effects of their common asso- 
ciations with other variables. For example, consider the correla- 
tion coefficient .38 between the number of brood sows per 1,000 
acres (D) and the percentage of farm land in corn (E) in the 
25 counties which have been used as an illustration. 


PART IV—PARTIAL CORRELATION COEFFICIENTS 37 


This is a statistically significant positive correlation, as shown 
by the fact that .38 is 2.23 times its own standard deviation. (We 
may determine from a table of the probability integral, such as 
Pearson’s ‘‘Tables for Statisticians and Biometricians’’, Table 
II; or Pearl and Miner’s table published as Table No. 40 in 
Pearl’s ‘‘Medical Biometry and Statistics’’, the likelihood that 
nearly 99% of the correlation ratios caleulated from similarly 
selected data would be greater than zero.) 

The question arises, is rps == .38 because of some underlying 
relation between these variables, or merely because each of them 
is intimately associated with some other variable, such as price 
per acre of land (X)? We seek an answer in the ‘“‘partial 
correlation coefficient between E and D independent of X’’, 
which we shall denote by the symbol rpz.x. The formula is 


lps — Trpx 4 rrpx 


TD..x SO eee" 
Vi1 _ Ppx” V 1— Tex? 
In our example 
38 — 3 X .86 
lrpr-x — — 19 


V1 — (.538)? VY 1 — (.86)? 7 


This means that if we could eliminate the common association of 
variables D and E with X, there would actually remain a small 
negative correlation between D and E; that is, independent of 
their common association with land value, there is a very slight 
tendency for large numbers of brood sows per 1,000 acres to be 
associated with small percentages of farm land in corn and vice 
versa. 

In order to make clear the meaning of the partial correlation 
coefficient, we shall give two explanations as follows: 

First. Imagine the number of counties in our problem in- 
creased to some large number such as a thousand, with no change 
in the simple and partial correlations discussed above. Con- 
sider a group of counties whose land values all lie in some such 
small interval as from $250 to $260 per acre. There might be 25 
or more counties in such a group, and for practical purposes we 
could consider their land values to be the same. We could then 
calculate rpz for this group, thus determining the degree of 
association between the number of brood sows per 1,000 acres 
and percentage of land in corn in counties having a common land 
value. This could be done for each other small group having 
a common land’ value. Then it may be said that the partial 
correlation coefficient rpg.x would be a kind of average of all 
the simple correlation coefficients so obtained. For an illustra- 
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tion of this, see Pearl’s ‘‘Medical Biometry and Statistics’’, pp. 
322-25. 

Second. In the three variable problem considered above, in- 
volving D, E and X, let us consider the estimation first of D from 
X, then of E from X. The two regression equations are written 








_— oD 
D = Mp + rox (X — Mx), 
Cx 
-_ CE 
and E = Ms + Trex (X = Mx), 
Ox 


as explained in Part I. After the values of D and E are calcu- 
lated for each value of X, two groups of errors of estimate can 
be computed, (D —D) and (E —E). If these errors of esti- 
mate are arranged in pairs, one pair for each value of X, then 
it may be proved that the partial correlation coefficient rpz.x 18 
equal to the simple correlation coefficient between these pairs of 
errors of estimate. (See Kelley’s ‘‘Statistical Method’’, pages 
284-287.) 

The explanations given above may be extended to partial corre- 
lation coefficients of higher orders. Thus, if we first calculate 
as above 


lTpr-x = — 19, Tap-x = 22, TaE-x ——- — 13 


we may then find the partial correlation coefficient between D 
and E independent of both corn yield per acre (A) and land 
value (X) by means of the formula 


Tpe-x —(Yap-x) (Yaz-x) 
lpr-ax = 


V 1 — Tap-x? V 1— Tar-x? 
— .19 — (.22) ( —.13) 


— V 1— (22)? VI—(— 13)? 
eT 


According to the first explanation given above, this means that 
for groups having corn yields per acre the same, as well as land 
values the same, the average simple correlation between brood 
sows per 1,000 acres and percentage of land in corn would be 
negative ( — .17), but not highly significant statistically. Ac- 
cording to the second explanation, — .17 is the simple correla- 
tion coefficient between two series of errors of estimate: the first 


being the errors of estimate, (D — D), made when we estimate 
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values of D from given values of A and X (Part II); and the 


second (E — E), made when we estimate values of E from the 
same given values of A and X. 

It is obvious from the foregoing that the calculations involved 
in multiple correlation though extensive are simple in form. A 
table giving values of (1 — r?) or V 1 — 1? for various values of 
r is highly desirable. John Rice Miner has calculated such tables. 
Publishers, John Hopkins Press. The calculations are quickly 
performed either with a machine or by means of a slide rule. 

In most cases, however, a relatively brief extension of the cal- 
culations deseribed in Part III will yield all the partial regression 
coefficients that are desired; namely, those of highest order giv- 
ing the correlations between the criterion and the several inde- 
pendent variables. For example, in our six variable problem, 
we may wish the partial correlation coefficient between percent- 
age of land in corn (E) and land value (X) independent of the 
other four variables. The symbol is rex.ancp. Its value can, of 
course, be obtained by building up the partials of lower orders 
according to the formulas already given, but the quicker method 
will now be explained. 

If in Table 8b we should interchange the two columns E and 
X, as is done in Table 10, and make the necessary re-caleulations 
in the last block (which is now the X-block), the last number in 
column E (line 24) with sign changed (1.0706) is easily seen to 
be the partial regression coefficient, Bex. This is the coefficient 
that would be used in the regression equation if we were con- 
sidering E as the dependent variable, and estimating E from X 
and the remaining four variables. fxs (calculated in Part IIT) 
and Bex are called ‘‘conjugate regression coefficients’’, We may 
now use the formula 


Tex-apcp = V Bxe X Bex 


= V 4547 X 1.0706 
= .6977 


It has already been explained that the notation used for the 
B’s in a six variable problem is quite inadequate. It should be 
observed that the complete notation for the above equation is 


Tex-apcp == V Bxe-ascp < Bex-ancp- 


If we wish to calculate rpx.asce we must have Apx in addition to 
the Bxp previously calculated. To get Box we may interchange 
eolumns D and E in Table 10 with the corresponding change in 
block letters, and make the necessary re-calculations as in Table 
11. The last number now appearing in column D (line 24) with 
sign changed (.3511) is Bpx. Then 


Tpx-ABCE — V .0804 x 3511 = .1679 
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Continuing as above, successively interchanging columns and 
blocks C and D, B and C and finally A and B from their position 
in Table 11 and doing the increasingly greater amount of ealeu- 
lation each time, we obtain successively Bcx = .2041, Bax = .6399 
and Bax == .9822. From these and their previous!y calculated 
conjugates we obtain 


r'cx-ABDE — 1253, rspx-ACDE —— .4436, Tax-BCDE = .4936 


The amount of additional work is not great, and the information 
obtained may be of great importance. 

We reach the conclusion in our illustrative problem that land 
value (X) is more highly correlated with percentage of land in 
eorn (E), independent of the associations of E and X with the 
other four variables, than it is with any of the other variables 
which we have considered. This is the same conclusion, though 
with somewhat different quantitative relations, that was deduced 
from the partial regression coefficients and the regression equa- 
tion worked out in Part III. 

If the investigator does not care for the simple (zero order) 
correlation coefficients (Table 7), but wishes only the highest 
order partial correlation coefficients, together with the multiple 
regression equation, he may avoid the calculation of the r’s and 
proceed directly to the solution of the normal equations using 
only the product-moments as the necessarv data. See Tolley 
and Ezekiel, ‘‘A Method of Handling Multiple Correlation Prob- 
lems’’, Quar. Pub. Am. Statistical Asso., Dee. 1923. 

We shall close this part by returning to the problem first con- 
sidered; namely, the correlation between number of brood sows 
per 1,000 acres (D) and percentage of land in corn (E); but 
now we shall find what it would be independent of all the other 
variables. The symbol is rpg.ascx. 

This may be found by continuing the process first described, 
calculating in all ten partial correlation coefficients of first order 
(such a8 rpg.x), 81x of second order (such as rpg.ax), three of the 
third order (such as rpg.asx) and finally the one required. 

The alternative is to calculate Bpz and Bep as in Part III and 
take the square root of their product. This is very easily done. 
Simply return to Tables 10 and 11 and compute the reverse as 
far as line 2 in each table. Then read from reverse line 2, column 
D, in Table 10 the result, Bep == — .0415, and in Table 11 
Bos == — .0763. It is obvious that the amount of new calculation 
is trivial. Not only have we obtained the required £’s, but we 
also have an illustration of the important fact that two conju- 
gate regression coefficients such as these must always have the 
same sign, either both positive or both negative. Furthermore, 
the corresponding partwal correlation coefficient takes the same 
sign as the two B’s have. Hence, finally ; 
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rprE-ABCXxX — — vA = .0763) ( a .0415) 


The reader who is interested in partial correlation will find an 
excellent discussion by Mordecai Ezekiel in the Journal of Farm 
Economies, Vol. 5, No. 4, pp. 198-203, ‘‘ The Use of Partial Corre- 
lation in the Analysis of Farm Management Data’’. 

An interesting case of partial correlation is that in which time 
enters as one of the variables. In such ‘‘time series’’, there is 
frequently a high correlation merely because two variables are 
changing with time in some regular way, though they may have 
no conceivable relation to each other. Also, two variables may 
have a certain periodicity which affects their correlation. For 
a discussion of this, see H. L. Rietz, ‘‘ Handbook of Mathematical 
Statistices’’, Chapter X. 


PART V. CODING 


Coding, in the method herein presented of preserving the 
identity of the individual observations, is the equivalent of the 
usual grouping into classes and translating the origin of meas- 


TABLE 12. CODED VALUES 











A | B | Cc | D | E | x | Ss 
® 
County o bp Lo > 
a 5 |. | 
= = ' So 

- 0 " " e 

© a be e nN | N |. 

a “ “< 7 Ss aw 
Allamakee 10 11 21 11 2 | 9 | 64 
Bremer 6 13 20 19 10 =| 13 81 
Butler 4 19 27 16 10° 17 ‘| 93 
Calhoun 11 33 32 15 14 | 28 | 133 
Carroll 9 25 31 27 12 | 26 =CO*| 130 
Cherokee 12 23 33 32 12 | a 139 
Dallas 10 22 26 16 14 (| 24 =| 112 
Davis 1 9 24 0 5 | 10 =| 49 
Fayette 6 13 21 16 8 | 14 | 78 
Fremont 4 17 27 20 16 | i. | 104 
Howard 0 18 27 10 4 | 12 | 71 
Ida. 10 23 37 38 10 27 =| 145 
Jefferson 7 14 20 10 8 16 | 75 
Johnson 11 13 24 30 9 | 15° 4 106 
Kossuth 8 24 35 16 10 | 2 | 113 
Lyon 8 31 36 26 13. | 28 | 141 
Madison 4 16 25 12 8 | 18 | &3 
Marshall 15 19 28 20 13 7 24 =| 118 
Monona 4 20 30 16 10 | 16 | 96 
Pocahontas 10 30 33 15 14 «| 26 =| 128 
Polk 11 22 19 10 |: ae 25 SC} 99 
Story 12 21 26 17 16: i 28 | 120 
Wapello 5 16 19 12 6 | 1 Ce 75 
Warren 3 18 24 9 | ba 78 
Winneshiek 6 18 23 20 6 | 12. | 85 
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urement. The method is explained below, using as illustration 
the same data as hereinbefore. The authors wish to express their 
opinion, however, that except for purposes of illustration, coding 
is not desirable in handling so small a number of observations. 
The time taken in coding more than offsets the time saved in 
calculation. 

Coding is desirable first, if the individual observations on one 
or more of the variables are numbers of more than two digits; 
and second, if the number of observations is so large as to war- 
rant the use of punched ecards with a sorting machine. It should 
be distinctly understood that the only purpose of coding is to 
save time. 

The coded values of our variables are given in Table 12. The 
values of the coded A-variable are formed by subtracting 30 bu. 
per acre (the smallest yield in the original list) from each orig- 
inal observation. No change is made in the values of B. Each 
original value of C is divided by 5 and the numbers ‘‘rounded’’ 
in the usual manner. The original D-values are first divided by 
2, then decreased by 10; while the E-values are first decreased 
by 10, and the results divided by 2. The X-values are divided 
by 10. 

We wish to emphasize the fact that for purposes of illustrat- 
ing the various ways of coding, we have greatly overdone the 
thing. Ordinarily subtraction should be confined to such easily 
subtracted numbers as 10, 50, 100, ete. while division should be 
limited to division by 10, 100, ete. 

It is desirable that coded values should all be less than 100, 
and usually they should be less than 50. Coding by addition 
can be used to eliminate negative values of a variable, and coding 
by multiplication can be used to eliminate decimals. 

Coding by subtraction (or addition) does not affect the stand- 
ard deviations or the correlation coefficients at all. A eoded 
mean, however, is less (or greater) than the true mean by ex- 
actly the same amount as the coded value of an observation is 
less (or greater) than its true value. Thus, Table 13 shows 
that o, = 3.70 bu. per acre, just as it is if ecaleulated from true 
values but the coded mean of 7.48 must be inereased by 30 to 
equal the true mean of 37.48 bu. per aere. 

Coding by division (or multiplication) has no effect on corre- 
lation coefficients, but produces a corresponding division (or 
multiplication) of both the mean and the standard deviation. 
If, however, division is accompanied by a ‘‘rounding’’ of the re- 
sulting coded values, as is practically always the case, small dis- 
erepancies will exist between the true means and standard devia- 
tions and those obtained by adjusting the coded means and 
standard deviations. However, if the coding is not radical, the 
differences are always small compared to their standard devia- 
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? tions, and are not, therefore, statistically significant. For ex- 
ample, coded oc = 5.34, giving an adjusted oo = 5.34 K 5 =26.70 
acres, which should be compared with the true og == 26.78 

. + 2.56. Coded Mx = 19.76, adjusted Mx = 10 & 19.76 = $197.60 
per acre, and true Mx — $198.20 + $8.26. (Sheppard’s correc- 
tions may be applied. Kelley: ‘‘Statistical Method’’, p. 167.) 

In eases where both subtraction and division have been used 
in coding, means must be adjusted by both multiplication and 
addition, the order being the reverse of that used in coding. 
Standard deviations must be adjusted by multiplication only. 
As a first example, consider variable D. In coding, values of D 
were first divided by 2 and then 10 was subtracted from the 
quotient. In adjusting the coded standard deviation, we merely 
multiply by 2 but in adjusting the coded mean, we must first 
add 10, then multiply by 2. (See Table 13.) As a second 
example, E was coded by first subtracting 10 and then dividing 
by 2. To adjust the mean, therefore, we must first multiply the 
coded mean by 2 and then add 10 to the product. To adjust 
the standard deviation, simply multiply by 2. 

Means and standard deviations must be adjusted before use 
in the regression equation. 

The values of coded S are found by adding the several coded 
values of the other variables without any reference whatever to 
the values of S used in Table 6. 

Table 14 exhibits the simple correlation coefficients obtained 
by using coded values. In no ease is there a significant diverg- 
ence from the values given before. 


TABLE 14. CORRELATION COEFFICIENTS FROM ORIGINAL DATA 
AND FROM CODED DATA 





B C D E x 
Orig-| Cod- || Orig-| Cod- || Orig-| Cod- || Orig-| Cod- || Orig-| Cod- 
inal | ed | inal | ed || inal | ed |/inal | ed | inal | ed 

















| 41] .42 || .25 | .25 || .50| .50 || .53 | .53 || .67| .67 
Tl | || .75 | .75 || 84] .86 || .68| .65 || .80 | .81 
| | | | 57 | .56 || 50 | .45 || .64| .62 
I | I | | | || 38 | .39 || .53 | .b4 
| | | | | I | || -86 | .84 


The value of R calculated from these data is .94. The re- 
gression coefficients are contrasted in Table 15. 

The differences are comparatively great, but the statistical sig- 
nificance of the results is little altered, even though the coding 
was intentionally carried to excess. 
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TABLE 15. REGRESSION COEFFICIENTS 


Partial Regression | | | 


| 
Coefficient of X on A | oe 2 ee 
B’s calculated from | 
original data .20 31 08 08 | .45 
B’s ealeulated from | | | | | 
coded data 22 38 .03 12 42 





PART VI. PRECAUTIONS AND SUGGESTIONS 


Before any correlation study is undertaken, it is important to 
make a serious effort to think through the nature of the causes 
connecting the variables. Much valuable time and effort are 
wasted by rushing into elaborate calculations before a definite 
plan is formulated. Many students, laboriously working out 
correlation coefficients, feel that their work must have a certain 
virtue simply because they have spent so much time in ealeu- 
lation. On the other hand, preliminary correlation studies are 
often indispensable as a guide to the formulation of the final 
plans even though the latter may not include the correlation 
methods at all. 

Cause and effect cannot be determined by correlation. Two 
variables may be constantly and intimately associated and yet 
have no causal relations whatever. The correlation coefficients 
merely point the way to further study and investigation. 

Utter familiarity with the data is a prerequisite to successful 
deductions. Correlation is not a magic formula. Mere ealeula- 
tion, no matter how intricate or extensive, can never take the 
place of intimate, ‘‘common sense’’ knowledge of the records. 
Only the man who has worked over his material from many 
angles until he has become thoroughly familiar with it can hope 
to apply correlation coefficients and regression lines in a truly 
fruitful way. 

There is a tendency to look upon correlation coefficients as an 
end in themselves. In some eases, the mechanical labor absorbs 
so much energy and time that there is very little left for the 
real job of interpretation. In reality, the correlation coefficients 
and related constants are usually just a beginning in any serious 
study. Unless hard thinking and common sense are used in in- 
terpretation, correlation work may do more harm than good. 

Two extremes should be avoided in your attitude toward the 
correlation results. On the one hand do not be discouraged if 
the correlation coefficient is lower than expected, or if the esti- 
mated values of the criterion vary widely from actual. Study 
with the greatest care the cases which deviate most widely. Are 
they due to accidental or unusual circumstances, and can such 
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e avoided? Should the relationship be expressed by a curved 
egression line rather than by one which is straight? Is it nec- 
ssary to include other variables to account for the discrepancies? 
emember, it is not impossible that important discoveries can 
e initiated by first learning that expected correlation does not 
eally exist. On the other hand, do not be too easily satisfied. 
t would be a shortsighted policy to stop with a correlation co- 
fficient of .96 when a more perfect explanation might be readily 
pparent after a little further work. 

If the number of independent variables is large and the num- 
er of observations relatively small, the multiple correlation co- 
fficient seems to gather a certain amount of ‘‘fictitious correla- 
ion’’ merely from the multiplication of the number of variables. 
. B. Smith has a correction formula to be used in such eases. 
his is expected to appear in the March, 1925, issue of the 
ournal of the American Statistical Society. 

The formula is 





1 — R° 
(Corrected R?) =1 — 
M 
a e 
N 


here M is the number of independent variables and N is the 
umber of observations. 

What is the real object of correlation coefficients and their 
elated concepts? The details vary with the ficld of investiga- 
ion, with the particular problem in hand, and with the mental 
eculiarities of the investigator. The purely scientific effort to 
etermine causal relations, the prediction of market prices, voca- 
ional guidance, educational policies, the correct method of scor- 
g corn, heredity, land values, the correction of yields for soil 
ariation,—these are some of the problems attacked with corre- 
tion methods. The research worker must always interpret his 
esults in the light of his own knowledge. After all, correlation 
simply one scheme for discovering and evaluating relationship. 
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